Search CORE

34 research outputs found

Doctor of Philosophy

Author: Nellans David
Publication venue: University of Utah
Publication date: 01/12/2014
Field of study

dissertationWith the explosion of chip transistor counts, the semiconductor industry has struggled with ways to continue scaling computing performance in line with historical trends. In recent years, the de facto solution to utilize excess transistors has been to increase the size of the on-chip data cache, allowing fast access to an increased portion of main memory. These large caches allowed the continued scaling of single thread performance, which had not yet reached the limit of instruction level parallelism (ILP). As we approach the potential limits of parallelism within a single threaded application, new approaches such as chip multiprocessors (CMP) have become popular for scaling performance utilizing thread level parallelism (TLP). This dissertation identifies the operating system as a ubiquitous area where single threaded performance and multithreaded performance have often been ignored by computer architects. We propose that novel hardware and OS co-design has the potential to significantly improve current chip multiprocessor designs, enabling increased performance and improved power efficiency. We show that the operating system contributes a nontrivial overhead to even the most computationally intense workloads and that this OS contribution grows to a significant fraction of total instructions when executing several common applications found in the datacenter. We demonstrate that architectural improvements have had little to no effect on the performance of the OS over the last 15 years, leaving ample room for improvements. We specifically consider three potential solutions to improve OS execution on modern processors. First, we consider the potential of a separate operating system processor (OSP) operating concurrently with general purpose processors (GPP) in a chip multiprocessor organization, with several specialized structures acting as efficient conduits between these processors. Second, we consider the potential of segregating existing caching structures to decrease cache interference between the OS and application. Third, we propose that there are components within the OS itself that should be refactored to be both multithreaded and cache topology aware, which in turn, improves the performance and scalability of many-threaded applications

The University of Utah: J. Willard Marriott Digital Library

Interference aware cache designs for operating system execution

Author: Balasubramonian Rajeev
Nellans David
Publication venue: University of Utah
Publication date: 01/01/2009
Field of study

Journal ArticleLarge-scale chip multiprocessors will likely be heterogeneous. It has been suggested by several groups that it may be worthwhile to implement some cores that are specially tuned to execute common code patterns. One such common application that will execute on all future processors is of course the operating system. Many future workloads will spend a large fraction of their execution time within privileged mode, either executing system calls or pure operating system functionality. Vast transistor budgets and relatively low on-chip communication latencies make it feasible to off-load the execution of privileged instruction sequences on to such a custom core. In this paper, we first examine this off-load approach and attempt to understand its benefits. We then try to architect a solution that captures the benefits of off-loading and eliminates its disadvantages. In essence, the benefits of offloading can be attributed to reduced cache interference, while its disadvantages are the high latency costs for off-load and cache coherence. Our proposed solution employs a special OS cache per core and improves performance by up to 18% for OS-intensive workloads without any significant addition of transistors. We consider several design choices for this OS cache and argue that it is a better use of transistor and power budget than the off-loading approach when both adding to the transistor budget or leaving it unchanged

The University of Utah: J. Willard Marriott Digital Library

Hardware prediction of OS run-length for fine-grained resource customization

Author: Nellans David
Sudan Kshitij
Publication venue
Publication date: 26/02/2010
Field of study

poste

The University of Utah: J. Willard Marriott Digital Library

A case for increased operating system support in chip multi-processors

Author: Brunvand Erik L.
Nellans David W.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

Journal ArticleWe identify the operating system as one area where a novel architecture could significantly improve on current chip multi-processor designs, allowing increased performance and improved power efficiency. We first show that the operating system contributes a non-trivial overhead to even the most computationally intense workloads and that this OS contribution grows to a significant fraction of total instructions when executing interactive applications. We then show that architectural improvements have had little to no effect on the performance of the operating system over the last 15 years. Based on these observations we propose the need for increased operating system support in chip multiprocessors. Specifically we consider the potential of a separate Operating System Processor (OSP) operating concurrently with General Purpose Processors (GPP) in a Chip Multi-Processor (CMP) organization

The University of Utah: J. Willard Marriott Digital Library

Beyond the socket: NUMA-aware GPUs

Author: Arunkumar Akhil
Bolotin Evgeny
Ebrahimi Eiman
Jaleel Aamer
Nellans David
Ramirez Alex
Ugljesa Milic
Villa Oreste
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/10/2017
Field of study

GPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism exposed via the programming model. With Moore's law slowing, for GPUs to continue scaling performance (which largely depends on SIMT core count) they are likely to embrace multi-socket designs where transistors are more readily available. However when moving to such designs, maintaining the illusion of a uniform memory system is increasingly difficult. In this work we investigate multi-socket non-uniform memory access (NUMA) GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for scaling GPU performance beyond a single socket.We would like to thank anonymous reviewers and Steve Keckler for their help in improving this paper. The first author is supported by the Ministry of Economy and Competitiveness of Spain (TIN2012-34557, TIN2015-65316-P, and BES-2013-063925)Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Author: Ebrahimi Eiman
Fu Yaosheng
Gupta Puneet
Migacz Szymon
Nellans David
Pal Saptadeep
Zhang Victor
Zulfiqar Arslan
Publication venue
Publication date: 30/07/2019
Field of study

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used parallelization strategy, but as the number of devices in data parallel training grows, so does the communication overhead between devices. Additionally, a larger aggregate batch size per step leads to statistical efficiency loss, i.e., a larger number of epochs are required to converge to a desired accuracy. These factors affect overall training time and beyond a certain number of devices, the speedup from leveraging DP begins to scale poorly. In addition to DP, each training step can be accelerated by exploiting model parallelism (MP). This work explores hybrid parallelization, where each data parallel worker is comprised of more than one device, across which the model dataflow graph (DFG) is split using MP. We show that at scale, hybrid training will be more effective at minimizing end-to-end training time than exploiting DP alone. We project that for Inception-V3, GNMT, and BigLSTM, the hybrid strategy provides an end-to-end training speedup of at least 26.5%, 8%, and 22% respectively compared to what DP alone can achieve at scale

arXiv.org e-Print Archive